## [1] 113937 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
ListingKey is a unique key for each listing. I wonder if all records have unique ListingKeys, let’s check…
## [1] 1698
## [1] "ProsperScore"
## ListingKey ListingNumber ListingCreationDate
## 32681 00223594917038064A7C947 998257 2013-11-15 16:58:37.167000000
## 32682 00223594917038064A7C947 998257 2013-11-15 16:58:37.167000000
## 17275 00473590513960687DD308F 941296 2013-10-07 15:47:36.023000000
## 32965 00473590513960687DD308F 941296 2013-10-07 15:47:36.023000000
## 7479 0098360461900952056DB93 1190614 2014-03-02 14:21:39.583000000
## 33221 0098360461900952056DB93 1190614 2014-03-02 14:21:39.583000000
## ProsperRating..Alpha. ProsperScore
## 32681 C 5
## 32682 C 6
## 17275 C 6
## 32965 C 4
## 7479 E 2
## 33221 E 4
There are 1698 duplicate records and looks like only ProsperScore is unique among the duplicate records.
## [1] 871
Removing the ProsperScore, I got 871 duplicate records. I’m thinking of focusing my analysis on ProsperRating and will not be looking into ProsperScore. So from the duplicate records, I’ll retain the first entry for each ListingKey and filter out the rest. Also, ProsperRating is only applicable for loans originated after July 2009, so I will filter out pre-July 2009 loans as well.
## [1] TRUE
## 1 2 3 4 5 6 7 NA's
## 6916 9715 14168 18090 15368 14388 5321 127
Majority of the loans have a ProsperRating of 4, which is an average rating. The best rating, 7, has only over 5000.
I won’t include NA ProsperRating in my analysis, so I will remove them from our sample data.
## 600 620 640 660 680 700 720 740 760 780 800 820
## 1038 1652 8746 13943 13839 13453 10905 7815 5213 3688 2098 1039
## 840 860 880
## 398 122 17
The majority of the borrowers have a CreditScore from 660 to 700.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3427 5000 5931 7083 1750000
There’s more than 1 million difference from 3rd quantile to the max. It’s also hard to see the distribution…
##
## 394400 416666.666667 466666.666667 483333.333333 618547.833333
## 1 1 1 1 1
## 1750002.916667
## 1
##
## 0 0.083333 0.25 1.416667 1.666667 1.833333
## 693 5 1 1 1 2
It’s hard to see the distribution, especially with an outlier in the millionth point. Using log10 certainly helps to see that majority of the borrowers have a monthly income of around $5000. I find the millionth monthly income strange… and there’s only 1. The next income below that outlier is $618,547. Another strange thing are the monthly income that are less than $1. Hmmm, curiouser and curiouser…
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3700 6356 8740 11600 72500 64342
The PrincipalBorrowed distribution is right-skewed, with median in $6356 and maximum in $72500. There’s a peak around $5,000.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1140 4568 11400 13900 498400
The AvailableBankcardCredit’s mean, $11400, is more than twice the median value, $4564. However, looking at the figure above, it is hard to see the median and the mean values but I think it is somewhere within the peak. I’ll transform the x-axis to Log10.
Now, I can definitely see the peak; and the red dashline, which are the median and mean value, respectively, surround the two peaks.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## 12 36 60
## 1613 58125 24228
There are 3 terms: 12 months, 36 months, and 60 months. Majority of the borrowers choose the 36 month term loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9062 13500 35000
In the LoanOriginalAmount distribution, you can clearly see the peaks for every $5,000 segment. So for $1,000-$5,000, the peak is $4,000. Likewise, for $26,000 to $30,000, the peak is $30,000. There seem to be a trend in the peaks… 10K, 20K, 25K, 30K, 35K. Do people just like a nice number, divisible by 5? But why 4K? Hmmm, is this worth exploring… or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1259 0.1775 0.1863 0.2474 0.3400
There’s a big peak in 0.30. I wonder what loan amount and term gives this yield.
Now, I’ll look at the next 6 variables which I will refer to as Bad Records and grouped them into three: Inquiries, Delinquencies and Public Records. I’ll also show their summaries, where the first one will pertain to the first plot while the second summary will pertain to the second plot.
I’ll look at the Inquiries first.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.9644 1.0000 27.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.285 6.000 78.000
Both distribution are right-skewed. In the first distribution, majority have zero Inquiries for the last 6 months. While in the second distribution, majority of the loans have had atleast 2 Inquiries.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3236 0.0000 51.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.659 2.000 99.000
Similar to Inquiries, the distribution for Delinquencies are also right-skewed with majority of the loans having zero delinquencies. This is backed by their median values which is zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.009254 0.000000 20.000000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.285 0.000 38.000
Again, the distribution for Public Records is right-skewed with majority of the loans having zero public records.
Among the bad record variables, Public Records has the least number, with maximum for the last 10 years is 38 compared to Total Inquiries with 78 and Delinquencies for the past 7 years as 99. This makes sense as Public Records pertains to court order or bankruptcies. With such rare occurence, I wonder what influence the Public Record have on ProsperRating. I’ll look at this in the Bivariate section.
All in all, the distribution for the bad record variables are right-skewed with majority of the data in or near zero point. This clearly shows that most of the borrowers do not have bad record. This is good to know for Prosper investors.
Originally, there are 113,937 loan records with 81 features or variables. However, analyzing 81 features is difficult. So I’ve chosen 14 features and categorized them into 4 groups: Ratings, Borrower Information, Loan Information, and Bad Records.
These are:
Ratings:Furthermore, the 113,937 records was trimmed down to 83,966; loans that were made prior to July 2009, those with duplicate Listing Keys and those records with NA ProsperRating were removed.
Observations:I want to look at the data in behalf of a potential investor, thus, I chose ProsperRating as the main feature of interest. ProsperRating is the rating set on the loan listing and it tells a potential investor the potential risk of a loan. I would like to know what are considered risk and its relationship to ProsperRating.
Note that I did not include ProsperScore in my analysis although ProsperScore is one of the main features that is used to calculate ProsperRating. This is because ProsperScore is calculated based on prosper loan history. It tells the risk associated with a borrower. Since I would be including other prosper loan historical details that are maybe used in ProsperScore computation, I thought it would be redundant to include ProsperScore as well.
I created four variables: LoanOrigDate, CurrentBadRecords, LoanAmtByTerm and CreditScoreRange.
I created LoanOrigDate in order to remove July 2009 records.
CurrentBadRecords is the sum of InquiriesLast6Months, CurrentDelinquencies , and PublicRecordsLast12Months.
The LoanAmtByTerm is Loan Amount per Month.
CreditScoreRange, which has 5 values, is a smaller version of CreditScoreRangeLower, which has 15 values.
Below is the mapping for CreditScoreRange:
CreditScoreRangeLower CreditScoreRange
600, 620, 640 600-659
660, 680 660-699
700, 720 700-739
740, 760, 780 740-799
800, 820, 840, 860 800-899
The frequency of data for each of the CreditScoreRangeLower is considered in deciding the appropriate partitioning used in the mapping. The reason for creating this variable is for visualization purposes.
I log transformed the right-skewed StatedMonthlyIncome in order to better see the distribution. The transformed distribution became normal, peaking at $5,000 monthly income. I find unusual the outlier monthly income at $1,750,002 and the income at less than $1. I also log transformed the AvailableBankcardCredit to also better see the distribution and found two peaks that are near or in-between its median and mean value.
I checked whether ListingKey is unique and I found that those records with duplicate ListingKey have duplicate data for all the variables except for ProsperScore. I find this odd and I think there may have been some input error. Since, I will not be analyzing ProsperScore, I’ve decided to only include the first entries for each duplicate record and filter out the rest.
I also filtered out records for loans created before July 2009. This is because ProsperRating is only applicable after July 2009. Furthermore, I’ve also removed the records with NA ProsperRating, since I am focusing my investigation on ProsperRating.
I’ve created a data frame that contains the summaries (quantile values) per ProsperRating of each of the variables or features. This is done to create a summary plot.
In the correlation matrix, note that I renamed the variables to a shorter version to make the plot larger.
## rating cscore income principal bankcard loanamount term
## rating 1.00 0.55 0.26 0.24 0.47 0.45 0.08
## cscore 0.55 1.00 0.15 0.23 0.55 0.29 0.07
## income 0.26 0.15 1.00 0.30 0.23 0.41 0.07
## principal 0.24 0.23 0.30 1.00 0.27 0.42 0.14
## bankcard 0.47 0.55 0.23 0.27 1.00 0.34 0.06
## loanamount 0.45 0.29 0.41 0.42 0.34 1.00 0.37
## term 0.08 0.07 0.07 0.14 0.06 0.37 1.00
## yield -0.96 -0.51 -0.26 -0.22 -0.46 -0.42 0.01
## inquiry6mos -0.26 -0.07 0.11 0.01 0.04 -0.10 -0.05
## inquirytotal -0.17 -0.14 0.18 0.01 0.04 -0.02 -0.01
## delinq -0.18 -0.20 -0.03 -0.12 -0.16 -0.15 -0.05
## delinq7yrs -0.21 -0.30 -0.04 -0.13 -0.28 -0.15 -0.04
## pubrec12mos -0.05 -0.04 0.00 -0.03 -0.04 -0.03 -0.02
## pubrec10yrs -0.15 -0.28 -0.09 -0.09 -0.26 -0.11 -0.01
## yield inquiry6mos inquirytotal delinq delinq7yrs pubrec12mos
## rating -0.96 -0.26 -0.17 -0.18 -0.21 -0.05
## cscore -0.51 -0.07 -0.14 -0.20 -0.30 -0.04
## income -0.26 0.11 0.18 -0.03 -0.04 0.00
## principal -0.22 0.01 0.01 -0.12 -0.13 -0.03
## bankcard -0.46 0.04 0.04 -0.16 -0.28 -0.04
## loanamount -0.42 -0.10 -0.02 -0.15 -0.15 -0.03
## term 0.01 -0.05 -0.01 -0.05 -0.04 -0.02
## yield 1.00 0.25 0.15 0.19 0.20 0.05
## inquiry6mos 0.25 1.00 0.53 0.04 0.08 0.01
## inquirytotal 0.15 0.53 1.00 0.03 0.10 0.01
## delinq 0.19 0.04 0.03 1.00 0.37 0.06
## delinq7yrs 0.20 0.08 0.10 0.37 1.00 0.07
## pubrec12mos 0.05 0.01 0.01 0.06 0.07 1.00
## pubrec10yrs 0.14 0.05 0.08 0.11 0.36 0.18
## pubrec10yrs
## rating -0.15
## cscore -0.28
## income -0.09
## principal -0.09
## bankcard -0.26
## loanamount -0.11
## term -0.01
## yield 0.14
## inquiry6mos 0.05
## inquirytotal 0.08
## delinq 0.11
## delinq7yrs 0.36
## pubrec12mos 0.18
## pubrec10yrs 1.00
The strongest correlation is ProsperRating vs LenderYield at -0.96, next is CreditScore at 0.55, followed by BankcardCredit and LoanAmount. Other variables that show atleast 20% positive correlation are StatedMonthlyIncome and PrincipalAmount. Inquiry6Months and Delinquencies7Yrs also show atleast 20% correlation, but in the negative. All the bad record variables are in the negative correlation.
I am wondering about LenderYield, it has an almost perfect correlation with ProsperRating. So, I looked into Prosper website and found out that LenderYield is the borrower’s interest rate less lender’s servicing fee. And the Borrower’s rate is I think based on ProsperRating. In short, LenderYield may be dependent on ProsperRating. I will not be including this in my analysis because it is not a predictor variable for ProsperRating.
## prosper$ProsperRating: 1
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 219 290 1049 1563 1466 1217 554 304 162 57 23 10 1 1 0
## --------------------------------------------------------
## prosper$ProsperRating: 2
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 571 782 2442 2431 1747 851 421 234 143 63 19 7 2 2 0
## --------------------------------------------------------
## prosper$ProsperRating: 3
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 214 503 2669 3147 2452 2075 1500 854 455 189 74 32 3 1 0
## --------------------------------------------------------
## prosper$ProsperRating: 4
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 18 42 1807 4063 3959 3621 2338 1117 580 358 136 38 9 3 1
## --------------------------------------------------------
## prosper$ProsperRating: 5
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 16 35 650 1950 2638 3177 2798 2164 1210 443 194 71 17 5 0
## --------------------------------------------------------
## prosper$ProsperRating: 6
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 0 0 129 789 1570 2284 2709 2413 1896 1513 685 285 98 15 2
## --------------------------------------------------------
## prosper$ProsperRating: 7
## 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880
## 0 0 0 0 7 228 585 729 767 1065 967 596 268 95 14
The ProsperRating vs CreditScore plot shows an increasing trend. The data points gets bigger as it moves laterally along the CreditScoreRange and diagonally along the ProsperRating. Majority of the points are in the middle of the CreditScoreRange. There are no datapoints in the lowest CreditScore range for the top 2 ProsperRatings, 6th and 7th; and no datapoints in the highest CreditScore for the bottom 3: 1st, 2nd and 3rd.
I want to look at the ProsperRating density plot by CreditScoreRange to better see the peaks in the CreditScore for each ProsperRating. This time, I will use 5 groups instead of the 15 CreditScoreRangeLower values.
In this plot, I can clearly see in how CreditScore peaks changes as ProsperRating improves. The lowest CreditScore peaks in the bottom 3 ProsperRating while the highest CreditScore peaks in the top two ProsperRating.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3871 4852 5774 1750000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4167 5023 6083 483300
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3083 4333 5185 6250 618500
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3500 5000 5825 7000 394400
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3917 5417 6442 7750 108800
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4167 5833 6744 8167 91670
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4833 6667 7663 9167 108300
I’ve removed the top 5% of the StatedMonthlyIncome data in order to better see the relationship between Income and Rating. What I notice immediately are the vertical lines that tells me income is quite constant throughout the rating range. Even with overplotting, I can still distinguish the vertical lines. So, same income can be seen in the bottom rating up to the top rating but there is overplotting in the middle of the income range compared to its lower and higher range. In the 1st rating, although I still see the vertical lines in the higher income range, there are less of them and as the rating goes up, more points can be seen in that side than in the lower side of the income range.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2500 4500 6375 8000 44800 5635
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 3000 4550 6502 8000 50900 7155
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 3000 5000 7130 9000 54220 10827
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 4000 6381 8643 11500 72500 14624
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 4000 7500 9418 13000 53200 12112
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 4137 8000 10480 15000 67000 10018
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 5000 10000 12200 17100 55900 3971
There area a lot of outliers, but I can clearly see that the PrincipalBorrowed median increases as the rating increases.
## Warning: Removed 2420 rows containing non-finite values (stat_boxplot).
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 194 1233 4363 4492 150500
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 332 1500 4223 4546 395500
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 597.8 2430.0 6049.0 6902.0 412800.0
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1154 3914 8317 10020 373300
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2042 6262 12070 15650 360000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4344 11620 18870 25490 406100
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 15030 28140 36230 48560 498400
There are more outliers in the beginning of the AvailableBankcardCredit range than at the end. Same with PrincipalBorrowed, the AvailableBankcardCredit median also increases as rating increases. However, there is a big difference in the median of the top ProsperRating compared with the rest; 7th rating is 59% more than the 6th rating.
## prosper$ProsperRating: 1
##
## 36
## 6916
## --------------------------------------------------------
## prosper$ProsperRating: 2
##
## 12 36 60
## 288 7556 1871
## --------------------------------------------------------
## prosper$ProsperRating: 3
##
## 12 36 60
## 295 9756 4117
## --------------------------------------------------------
## prosper$ProsperRating: 4
##
## 12 36 60
## 264 10041 7785
## --------------------------------------------------------
## prosper$ProsperRating: 5
##
## 12 36 60
## 316 8888 6164
## --------------------------------------------------------
## prosper$ProsperRating: 6
##
## 12 36 60
## 269 10526 3593
## --------------------------------------------------------
## prosper$ProsperRating: 7
##
## 12 36 60
## 181 4442 698
Looking at the density plot above, the 12th and 36th Term is quite constant throughout the ProsperRating. Also, the lowest ProsperRating only have 36 term. For the 60 Term, there’s a peak in the 4th rating.
On another note, the correlation of Term with ProsperRating is weak at 0.08. On its own, it has a weak influence and I am planning on using it with LoanOriginalAmount to get the Loan value per Month.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 3463 4000 16800
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3522 4000 4576 5000 15900
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6001 7075 10000 15000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5000 10000 10380 15000 25000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 10000 11600 15000 35000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5800 10000 11450 15000 35000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5900 10570 11570 16000 35000
I didn’t expect to see a clear trend between ProsperRating and LoanAmount, but the plot definitely shows that as loan amount increases, the rating increases. I am equating ProsperRating to risk, so I thought that smaller loan amount would have lower risk, and thus should have high rating. But that is not being shown here. Hmmm…
The median loan amount of the 1st, 2nd, and 3rd ratings are way smaller compared with the top 4 ratings. There’s is a 40% difference in the median loan amount between the 3rd and 4th rating. I can see a clear separation of the bottom 3 ratings from the rest.
Now, let’s look at the bad record variables. All these have negative correlations with ProsperRating. Similar to the Univariate section, I will group these variables into three: Inquiries, Delinquencies and Public Records.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.853 3.000 22.000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.448 2.000 27.000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.099 2.000 12.000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.9104 1.0000 12.0000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7213 1.0000 11.0000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5897 1.0000 12.0000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4676 1.0000 7.0000
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 5.825 8.000 66.000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 5.355 7.000 71.000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 4.491 6.000 78.000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 3.00 4.28 6.00 44.00
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.803 5.000 32.000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.561 5.000 28.000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.148 4.000 36.000
In both plots, I can see a decreasing trend. The topmost Rating has the lesser amount of datapoints compared to the lower Ratings. This is backed by the median value for both variables; the median decreases as rating increases.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5844 1.0000 22.0000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5996 1.0000 32.0000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4216 0.0000 21.0000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2896 0.0000 21.0000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2334 0.0000 51.0000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1546 0.0000 21.0000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.053 0.000 10.000
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.947 5.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 5.779 6.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.815 4.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.903 3.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.084 1.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.946 0.000 99.000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.508 0.000 71.000
The median for both variables is zero for each ProsperRating. For the CurrentDelinquencies, the 3rd quantiles for Rating 1 and 2 is higher by 1 point compared to the rest of the PropserRating, which is zero. For DelinquenciesLast7Years, the 2nd ProsperRating has a higher 3rd quantile value than the 1st ProsperRating.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01764 0.00000 3.00000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01379 0.00000 4.00000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01369 0.00000 4.00000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.008734 0.000000 20.000000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.006442 0.000000 4.000000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.004379 0.000000 4.000000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.001316 0.000000 2.000000
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3242 1.0000 11.0000
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4072 1.0000 34.0000
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3518 1.0000 16.0000
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3165 1.0000 25.0000
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2672 0.0000 16.0000
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1802 0.0000 38.0000
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06108 0.00000 8.00000
I mentioned in the Univariate section that the occurence of PublicRecords is rare and this is definitely shown in the plots above where the quantiles are mostly in the zero point. Furthermore, the correlation with ProsperRating for PublicRecordsLast12Months and PublicRecordsLast10Years is -0.05 and -0.15, respectively. This definitely show a weak relationship which tells me that PublicRecord variable have less influence on the ProsperRating. I’ll check this more in the Linear Regression.
ProsperRating correlates the strongest with CreditScore, followed by AvailableBankcardCredit and LoanAmount.
In the CreditScore vs ProsperRating, an increasing trend is seen; as CreditScore increases so does the ProsperRating.
In the StatedMonthlyIncome vs ProsperRating, majority of the datapoints are in the middle of the Income range. There are lots of vertical lines which suggest that income is fairly constant and the same income can be seen in all of the ratings. The difference in the median between each ProsperRating is not that high; the biggest difference is from ProsperRating 6th to 7th which is 834. Nevertheless, the median Income still increases as ProsperRating improves.
Similar with StatedMonthlyIncome, both the median value of ProsperPrincipalBorrowed and AvailableBankcardCredit increases as ProsperRating improves. Furthermore, the median of the AvailableBankcardCredit for the 7th Rating is more than twice the median of the 6th Rating.
The LoanOriginalAmount median also increases as ProsperRating increases. Lower loan amount have lower ProsperRating than loans with higher amount.
All inquiries, delinquencies and public record features have a negative relationship with ProsperRating; as Rating increases, frequency of bad records decreases.
The AvailableBankcardCredit correlates well with CreditScore at 0.55, which makes sense as they are both related to a borrower’s credit standing.
Both ProsperPrincipalBorrowed and StatedMonthlyIncome correlates more strongly with LoanOriginalAmount at 0.42 and 0.41, respectively than with ProsperRating. The LoanOriginalAmount eventually becomes ProsperPrincipalBorrowed, so its strong correlation is expected. For StatedMonthlyIncome, I guess the amount a borrower loans is a proportion of his income; the higher his income, the higher the amount that he can loan.
The CurrentDelinquencies, DelinquenciesLast7Years and PublicRecordsLast10Years have a stronger correlation with CreditScore compared with ProsperRating, at -0.20, -0.30, and -0.28, respectively. So it seems that these variables have more influence on the CreditScore than with ProsperRating; unlike with Inquiries where it has greater correlation or influence with ProsperRating than CreditScore.
CreditScore shows the strongest relationship with ProsperRating at positive correlation of 0.55, followed by AvailableBankcardCredit at 0.47 and LoanOriginalAmount at 0.45.
There are 4 main loan feature with positive correlation, not including Term. These are StatedMonthlyIncome, AvailableBankcardCredit, LoanOriginalAmount, and ProsperPrincipalBorrowed. First, I want to see the relationship of these variables with ProsperRating and CreditScore. Later, I will look at how these positive-correlated variables relate to the Bad Record variables.
I removed the top 5% and the relationship that pops out more is ProsperRating vs CreditScore where higher ratings have higher CreditScore. The StatedMonthlyIncome is quite constant where the same vertical lines can be seen in all the ProsperRating. What I can conclude in this plot is that CreditScore influences the ProsperRating more than the StatedMonthlyIncome.
Unlike the StatedMonthlyIncome, I can see the influence of AvailableBankcardCredit with CreditScore and ProsperRating. Lower CreditScore can be seen in the lower range of BankcardCredit and higher CreditScore is seen in higher BankcardCredit. Furthermore, the BankcardCredit variance increases as the ProsperRating increases.
Here, I can see that lower ProsperRating have lower CreditScore and higher ProsperRating have higher CreditScore.
Now, I want to look at the Loan per Month by dividing LoanOriginalAmount by Term. There is a variable that actually depicts what the borrower pay per month; this is MonthlyLoanPayment. I actually thought of using this instead of Loan Amount but I don’t want to add any bias that may incur because of rates applied to the loan.
## prosper$ProsperRating: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 83.33 111.10 96.19 111.10 466.70
## --------------------------------------------------------
## prosper$ProsperRating: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 69.44 111.10 122.10 138.90 833.30
## --------------------------------------------------------
## prosper$ProsperRating: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 97.22 152.20 171.20 236.10 1250.00
## --------------------------------------------------------
## prosper$ProsperRating: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 125.00 216.70 232.30 277.80 2083.00
## --------------------------------------------------------
## prosper$ProsperRating: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 142.40 250.00 262.80 333.30 2083.00
## --------------------------------------------------------
## prosper$ProsperRating: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 146.00 250.00 283.00 416.30 2083.00
## --------------------------------------------------------
## prosper$ProsperRating: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.78 166.70 277.80 308.60 416.70 2083.00
The plot above is similar to the LoanOriginalAmount vs ProsperRating by CreditScore. High ProsperRating have high CreditScore. Furthermore, the median amount for LoanAmtByTerm also increases as ProsperRating increases.
## Warning: Removed 64342 rows containing missing values (geom_point).
Again, I can see the relationship of ProsperRating and CreditScore; higher ProsperRating have higher CreditScore. This relationship is apparent in all the positive-correlated plots above.
Now, I want to see the relationship of the negative-correlated features, which I call Bad Record variables, with CreditScore. I am thinking of just using the bad record values incurred recently rather than the Overall Total Bad Record values. I think that the current values have more impact than the overall total. Among the bad record variables, InquiriesLast6Months have the strongest correlation at -0.26, next is -0.21 for Delinquency7Years followed by CurrentDelinquencies at -0.18. There is just a .03 difference in the Delinquency correlation values, so I think it would be enough just to use the recent/current values rather than using the overall total. Before, I disregard the past inquires, delinquencies and public records, let’s compare their R-squared value to check if its enough to only use the current bad record values.
##
## Calls:
## m1: lm(formula = as.numeric(ProsperRating) ~ I(-InquiriesLast6Months -
## CurrentDelinquencies - PublicRecordsLast12Months), data = prosper)
## m2: lm(formula = as.numeric(ProsperRating) ~ I(-TotalInquiries -
## CurrentDelinquencies - DelinquenciesLast7Years - PublicRecordsLast12Months -
## PublicRecordsLast10Years), data = prosper)
##
## ===================================================================================================================================================
## m1 m2
## ---------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 4.417*** 4.349***
## (0.007) (0.007)
## I(-InquiriesLast6Months - CurrentDelinquencies - PublicRecordsLast12Months) 0.268***
## (0.003)
## I(-TotalInquiries - CurrentDelinquencies - DelinquenciesLast7Years - PublicRecordsLast12Months - PublicRecordsLast10Years) 0.033***
## (0.001)
## ---------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.086 0.046
## adj. R-squared 0.086 0.046
## sigma 1.602 1.637
## F 7897.006 4016.836
## p 0.000 0.000
## Log-likelihood -158698.926 -160510.812
## Deviance 215424.559 224925.309
## AIC 317403.852 321027.624
## BIC 317431.866 321055.638
## N 83966 83966
## ===================================================================================================================================================
Look’s like having the total bad records have lesser influence on the result of the ProsperRating. The current bad records have 0.086 R-squared compared to 0.046 with total bad records. So, I will just use the Current Bad Record values. Also, I would just like to compare the positive-correlated variables with the combination or the total Bad Record values. So, I will add them up to one variable named CurrentBadRecords.
## Warning: Removed 17739 rows containing missing values (geom_point).
## prosper$CreditScoreRange: 600-659
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.908 3.000 34.000
## --------------------------------------------------------
## prosper$CreditScoreRange: 660-699
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.416 2.000 24.000
## --------------------------------------------------------
## prosper$CreditScoreRange: 700-739
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.145 2.000 51.000
## --------------------------------------------------------
## prosper$CreditScoreRange: 740-799
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.006 1.000 22.000
## --------------------------------------------------------
## prosper$CreditScoreRange: 800-899
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8345 1.0000 10.0000
I have trimmed down the CurrentBadRecords to 35 to better see the plot, as only CreditScoreRange 700-739 have a BadRecord count higher than 35. Majority of lowest ProsperRating can be found in the lowest CreditScoreRange and as CreditScore improves, ProsperRating improves and Bad Record count decreases. Furthermore, majority of loans with high Bad Record count have lower ProsperRating.
Now, let’s look at the relationship of the positive-correlated features with the CurrentBadRecord variable. Let’s start with Loan Amount by Term, faceted with CreditScoreRange.
Here, I can see smaller loan term value tend to have higher inquiries, delinquencies and public records. Furthermore, lower loan value also have low ProsperRating and more lower ProsperRating can be found in the lower CreditScore.
I remember that I was surprised to find out that lower Loan Amount have lower ProsperRating, and this plot certainly support this relationship. So, there is more risk in investing in lower loan amounts.
Again here, I can see that high AvailableBankcardCredit tend to have high ProsperRating, at the same time, they also have lower inquiries, delinquencies, and public records. Also, I can see that as CreditScore increases, the variance in the BankcardCredit also increases. Once more, low ProsperRating have higher risk; low ratings have more bad records associated with it and they tend to have lower AvailableBankcardCredit and CreditScore as well.
This plot is showing 95% of the StatedMonthlyIncome datapoints, removing the 1 million outlier. I can clearly see that the low ProsperRating shows more at the lower CreditScore and those with higher bad records also have lower PropserRating.
## Warning: Removed 6206 rows containing missing values (geom_point).
## Warning: Removed 22069 rows containing missing values (geom_point).
## Warning: Removed 19735 rows containing missing values (geom_point).
## Warning: Removed 13361 rows containing missing values (geom_point).
## Warning: Removed 2971 rows containing missing values (geom_point).
Again here, most lower ProsperRating appear at lower CreditScore and they have higher Bad Records. Also, most of the lower ProsperRating appear at the lower-end of the ProsperPrincipalBorrowed range. There is more risk in investing to loans which have lower previous loans.
I remember that LoanOriginalAmount have a higher correlation with ProsperPrincipalBorrowed than with ProsperRating. This makes sense as LoanAmount eventually becomes ProsperPrincipalBorrowed. Lower LoanAmount also have lower ProsperRating, same with lower PrincipalProsperBorrowed which also have lower ProsperRating.
What the previous plots have shown is that low income, low loan value, low bankcard credit and low principal borrowed have a higher risk. Higher risk means higher number of bad records and low ProsperRating. They all displayed the same trend and relationship with ProsperRating. So I think we can use linear model to predict ProsperRating.
But first, let’s look at the summaries of all the features for each ProsperRating.
Here, we can see the movement of the values for each ProsperRating at different quantiles. Generally, the movement of the values from 25% to 75% quantile is increasing at the same time lower ProsperRating have lower value compared to higher ProsperRating. For the bankcard, income, loanAmt, loanByTerm, and principal, the top ProsperRating has the biggest value.
Looking at the income at 100% quantile, the lowest ProsperRating has the biggest income; see how the line moves from being the lowest in the 3rd quantile to becoming the biggest value in the 100% quantile. This is due to the 1.75million outlier in the StatedMonthlyIncome distribution. This just goes to show that you cannot just look at one area to get the whole picture, especially when the data came from an outlier and not a good representative of the distribution.
In another note, with regards to the Current Bad Record plot; this shows that the top ProsperRating has the lowest value at each quantile compared to the other ProsperRatings.
Now, let’s do our model for predicting ProsperRating. I am going to add in the individual features, except for LoanOriginalAmount where I will apply the LoanValue per Term.
##
## Calls:
## m1: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower,
## data = prosper)
## m2: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term), data = prosper)
## m3: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit, data = prosper)
## m4: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed,
## data = prosper)
## m5: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed +
## StatedMonthlyIncome, data = prosper)
## m6: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed +
## StatedMonthlyIncome + I(-InquiriesLast6Months), data = prosper)
## m7: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed +
## StatedMonthlyIncome + I(-InquiriesLast6Months) + I(-CurrentDelinquencies),
## data = prosper)
## m8: lm(formula = as.numeric(ProsperRating) ~ CreditScoreRangeLower +
## I(LoanOriginalAmount/Term) + AvailableBankcardCredit + ProsperPrincipalBorrowed +
## StatedMonthlyIncome + I(-InquiriesLast6Months) + I(-CurrentDelinquencies) +
## I(-PublicRecordsLast12Months), data = prosper)
##
## ==============================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## ------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 4.711*** 3.952*** 3.746*** 4.375*** 4.330*** 4.438*** 4.455*** 4.455***
## (0.025) (0.025) (0.025) (0.084) (0.084) (0.080) (0.080) (0.080)
## CreditScoreRangeLower: .L 5.926*** 5.155*** 4.370*** 4.947*** 4.937*** 4.445*** 4.383*** 4.380***
## (0.153) (0.144) (0.144) (0.505) (0.504) (0.484) (0.483) (0.483)
## CreditScoreRangeLower: .Q -0.729*** -0.497** -0.806*** -1.511** -1.516** -1.602** -1.579** -1.578**
## (0.166) (0.156) (0.155) (0.559) (0.559) (0.536) (0.535) (0.535)
## CreditScoreRangeLower: .C -0.299 -0.265 -0.336* -0.105 -0.095 -0.152 -0.167 -0.167
## (0.158) (0.148) (0.147) (0.537) (0.536) (0.514) (0.514) (0.514)
## CreditScoreRangeLower: ^4 -0.067 -0.043 -0.115 0.126 0.151 0.113 0.118 0.118
## (0.139) (0.131) (0.130) (0.467) (0.467) (0.448) (0.447) (0.447)
## CreditScoreRangeLower: ^5 -0.063 0.004 -0.068 -0.142 -0.107 -0.132 -0.133 -0.133
## (0.118) (0.111) (0.110) (0.376) (0.376) (0.361) (0.360) (0.360)
## CreditScoreRangeLower: ^6 0.186 0.157 0.114 0.291 0.318 0.319 0.321 0.321
## (0.096) (0.090) (0.090) (0.285) (0.285) (0.273) (0.272) (0.272)
## CreditScoreRangeLower: ^7 -0.041 -0.038 -0.075 -0.026 -0.002 -0.001 -0.001 -0.001
## (0.077) (0.073) (0.072) (0.207) (0.207) (0.198) (0.198) (0.198)
## CreditScoreRangeLower: ^8 0.108 0.111 0.097 0.154 0.174 0.196 0.198 0.198
## (0.062) (0.058) (0.058) (0.148) (0.148) (0.142) (0.142) (0.142)
## CreditScoreRangeLower: ^9 -0.083 -0.059 -0.065 0.036 0.046 0.066 0.065 0.066
## (0.049) (0.046) (0.046) (0.108) (0.108) (0.103) (0.103) (0.103)
## CreditScoreRangeLower: ^10 0.042 0.045 0.044 0.055 0.057 0.064 0.064 0.064
## (0.039) (0.037) (0.037) (0.080) (0.080) (0.077) (0.077) (0.077)
## CreditScoreRangeLower: ^11 0.116*** 0.110*** 0.108*** 0.214*** 0.213*** 0.219*** 0.219*** 0.219***
## (0.031) (0.029) (0.029) (0.061) (0.061) (0.059) (0.059) (0.059)
## CreditScoreRangeLower: ^12 0.022 0.028 0.028 0.034 0.035 0.028 0.028 0.028
## (0.025) (0.024) (0.024) (0.048) (0.048) (0.046) (0.046) (0.046)
## CreditScoreRangeLower: ^13 -0.042* -0.021 -0.020 0.071 0.070 0.065 0.065 0.066
## (0.021) (0.019) (0.019) (0.039) (0.039) (0.038) (0.038) (0.038)
## CreditScoreRangeLower: ^14 -0.025 -0.020 -0.019 -0.000 0.001 -0.003 -0.003 -0.003
## (0.017) (0.016) (0.016) (0.033) (0.033) (0.031) (0.031) (0.031)
## I(LoanOriginalAmount/Term) 0.003*** 0.003*** 0.002*** 0.002*** 0.002*** 0.002*** 0.002***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## AvailableBankcardCredit 0.000*** 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## ProsperPrincipalBorrowed 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## StatedMonthlyIncome 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000)
## I(-InquiriesLast6Months) 0.202*** 0.202*** 0.202***
## (0.005) (0.005) (0.005)
## I(-CurrentDelinquencies) 0.048*** 0.048***
## (0.006) (0.006)
## I(-PublicRecordsLast12Months) 0.067
## (0.043)
## ------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.303 0.383 0.392 0.530 0.531 0.569 0.570 0.570
## adj. R-squared 0.303 0.383 0.392 0.530 0.531 0.568 0.570 0.570
## sigma 1.398 1.316 1.306 1.172 1.171 1.123 1.121 1.121
## F 2612.250 3481.109 3389.772 1301.593 1233.484 1360.465 1299.553 1237.869
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -147291.529 -142167.300 -141548.957 -30943.710 -30925.677 -30104.948 -30074.008 -30072.821
## Deviance 164168.955 145305.912 143181.469 26911.023 26861.611 24706.164 24628.380 24625.402
## AIC 294615.059 284368.601 283133.914 61925.419 61891.354 60251.896 60192.015 60191.642
## BIC 294764.469 284527.350 283302.001 62075.225 62049.044 60417.471 60365.474 60372.986
## N 83966 83966 83966 19624 19624 19624 19624 19624
## ==============================================================================================================================
The overall R-squared is 0.57, which is good; not very strong but there is a 57% chance that our model will hold true for predicting the ProsperRating. Also, the difference in adding StatedMonthlyIncome is just 0.1%, also similar is CurrentDelinquencies. Also, looks like PublicRecordsLast12Months has no impact on ProsperRating at all.
Loans with higher ProsperRating have higher CreditScore and loans with lower ProsperRating have lower CreditScore. This relationship resonates in the features with positive correlation: LoanAmount per Term, StatedMonthlyIncome, AvailableBankcardCredit and ProsperPrincipalBorrowed.
Furthermore, higher ProsperRating have the higher medians in all the features, except for the features with negative correlation which are Inquiries, Delinquencies and Public Records.
On the other hand, loans with higher number of Inquiries, Delinquencies and Public Records have lower ProsperRating than loans with lower number of Inquiries, Delinquencies and Public Records.
Majority of loans with low ProsperRating exhibits low value in Loan Amount, Bankcard Credit, Income and Principal Borrowed, at the same time they also have the higher Inquiries, Delinquencies and Public Records, than loans with better ProsperRating.
I find interesting the relationship between Loan Amount and ProsperRating; lower Loan Amounts tend to have lower ProsperRating. I find this odd as I thought that there is low risk in investing to such loans. But low loan amounts also have low Credit Score and higher number of Inquiries, Delinquencies and Public Records.
Re-thinking about this relationship and the demographics of borrowers for low loan amounts, I realized that if you borrow a small amount, then that is probably the amount that you can pay. You may have a low Bankcard credit and low income, and you may not be able to go back to Prosper and borrow more. Furthermore, if you are low in cash, then the possibility of being delinquent, inspite of your loan being low, is high. Also, come to think about it, If I have two loans, one is big and the other is small. I would first take care of paying the higher loan than the lower loan because the rate of the higher loan is high.
In retrospect, now it makes sense the relationship of Loan Amount with ProsperRating. There are more risk in investing with lower loan amounts than with higher loan amounts.
Yes, I created a linear model with dependent variable as the ProsperRating and the predictors as the Credit Score, Loan Amount per Term, Income, Prosper Principal Borrowed, and the Current Inquiries, Delinquencies and Public Records.
The variables in the linear model only account for 57% of the variance in the Prosper Rating. This also means that there is a 57% chance that the model will hold true for predicting the Prosper Rating.
One of the model’s limitation is the huge amount of variables that are not analyzed for predicting ProsperRating. In fact, only 12 out of 80 (excluding Prosper Rating) were analyzed and only 9 were included in this model. Another limitation would be the type of regression used. Linear regression may not be the best model for predicting an ordinal variable, such as ProsperRating. There may be a better regression model to use, such as Ordinal Regression.
This plot shows how CreditScore peak changes as ProsperRating improves. The lowest CreditScore peaks at ProsperRating 1-3 and then gets lower as ProsperRating improves, there is even no 600-659 CreditScore in the topmost ProsperRating. Similarly, the highest CreditScore has the lowest density in the ProsperRating 1, then goes up as ProsperRating improves where it peaks at ProsperRating 6 and 7.
Loans with low ProsperRating have low CreditScores; and as CreditScore improves so does the ProsperRating.
## Warning: Removed 17882 rows containing missing values (geom_point).
CurrentBadRecords is the sum of the recent Inquiries, Delinquencies and Public Records. This plot shows the relationship of CreditScore and ProsperRating with Current Bad Record. There are more low ProsperRating (1 and 2) in the lower CreditScoreRange than in the higher CreditScoreRange. As CreditScore improves, bad record count decreases and ProsperRating improves. Majority of the loans with high Bad Record count have low ProsperRating.
Majority of the loans in the lower CreditScore range have lower ProsperRating and as CreditScore improves, so does the ProsperRating. Furthermore, lower ProsperRating have higher numbers of Current Bad Records (Inquiries, Delinquencies and Public Records) and these lower ProsperRating appear at the lower-end of the Loan value by Term. As the loan value gets bigger, bad record count decreases and ProsperRating improves.
Loan Amount and Term are two of the main properties of a loan. This plot clearly shows that lower Loan Value by Term have more risk than higher loan value.
The Prosper data set originally contains 113,937 records with 81 features and I was able to trim it down to 83,966 records; removing loans that were made prior to July 2009, those with duplicate ListingKeys and those with NA ProsperRating.
There are a lot of challenges I faced with this study. One is not being familiar with Prosper, so going through their website and reading information on Prosper listing certainly helps, especially in understanding the different features of the Prosper loan.
Another challenge is choosing which feature is the most interesting. The problem with Prosper is that it has many interesting features, so it is hard to choose which one is the most interesting, like ProsperScore, Borrower’s Rate, Lender Yield. However, what I mostly see in the Prosper website is the Prosper Rating. I also like the fact that Prosper Rating is a measurement of a Loan’s risk. So this made me interested on the different features that may make a loan risky.
The most difficult task I find in this study is choosing 10-15 features out of 81. There are some features that may refer to the same thing or the same concept. Like AvailableBankcardCredit, RevolvingCreditBalance and BankcardUtilization. The last two has something to do with how you use your credit card, but I chose the first one because it is the one that I am most familiar with. One premise I follow is to choose features that I am most familiar or those that I can easily understand. Another premise that I follow is that the features should come from independent observations or not computed within the Prosper environment. At first, I included ProsperScore but I realized that this is computed based on other features in the loan. Same thing with Lender Yield, which I included in the early days of this study. LenderYield has a 96% correlation with ProsperRating, so its almost perfect. It made me wonder if I should include it as it will drive my linear model. So, I read more on its definition from the website and concluded to remove it from my analysis as LenderYield’s value may be dependent on ProsperRating, not the other way around. So finally, I came up with 13 features, including ProsperRating, that I am happy with.
The ProsperRating risk relationship made sense to me when I got into the Multivariate section. Here, I can finally see how ProsperRating is a measure of risk in the Bad Record plots against the Loan Amount per Term, Principal Borrowed, Bankcard Credit and Income. These plots clearly shows Low Prosper Rating have higher risk. I especially like how the Bad Record vs Loan Value per Term validated the relationship shown in the plot for LoanOriginalAmount vs ProsperRating, wherein low loan amounts have lower Prosper Rating. The Bad Record vs Loan Value per Term supported the risk involved with lower loan amounts and somehow changed my perception on investing to lower loan amounts as oppose to investing to higher loan amounts in Prosper.
I would be interested in exploring Ordinal Regression and how it could be use to predict ProsperRating; finding out if it is a better model in predicting ProsperRating than Linear Regression. I would also be interested in looking at the other independent variables available in Prosper, which feature can improve the model and give greater measure of predicting Prosper Rating.